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Abstract 


We investigate the capacity, convexity and characterization of a general family of norm- 
constrained feed-forward networks. 
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1. Introduction 

The statistical complexity, or capacity, of unregularized feed-forward neural networks, as 
a function of the network size and depth, is fairly well understood. With hard-threshold 
activations, the VC-dimension, and hence sample complexity, of the class of functions 
realizable with a feed-forward network is equal, up to logarithmic factors, to the number of 
edges in the network (Anthony and Bartlett, 2009; Shalev-Shwartz and Ben-David, 2014), 
corresponding to the number of parameters. With continuous activation functions the VC- 
dimension could be higher, but is fairly well understood and is still controlled by the size 
and depth of the network. 1 

But feedforward networks are often trained with some kind of explicit or implicit reg¬ 
ularization, such as weight decay, early stopping, “max regularization”, or more exotic 
regularization such as drop-outs. What is the effect of such regularization on the induced 
hypothesis class and its capacity? 

For linear prediction (a one-layer feed-forward network) we know that using regular¬ 
ization the capacity of the class can be bounded only in terms of the norms, with no (or a 
very weak) dependence on the number of edges (i.e. the input dimensionality or number of 
linear coefficients). E.g., we understand very well how the capacity of (^-regularized linear 
predictors can be bounded in terms of the norm alone (when the norm of the data is also 
bounded), even in infinite dimension. 

1. Using weights with very high precision and vastly different magnitudes it is possible to shatter a num¬ 
ber of points quadratic in the number of edges when activations such as the sigmoid, ramp or hinge 
are used (Shalev-Shwartz and Ben-David, 2014, Chapter 20.4). But even with such activations, the 
VC dimension can still be bounded by the size and depth (Bartlett, 1998; Anthony and Bartlett, 2009; 
Shalev-Shwartz and Ben-David, 2014). 
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A central question we ask is: can we bound the capacity of feed-forward network in 
terms of norm-based regularization alone, without relying on network size and even if the 
network size (number of nodes or edges) is unbounded or infinite? What type of regular¬ 
izes admit such capacity control? And how does the capacity behave as a function of the 
norm, and perhaps other network parameters such as depth? 

Beyond the central question of capacity control, we also analyze the convexity of the 
resulting hypothesis class—unlike unregularized size-controlled feed-forward networks, 
infinite magnitude-controlled networks have the potential of yielding convex hypothesis 
classes (this is the case, e.g., when we move from rank-based control on matrices, which 
limits the number of parameters to magnitude based control with the trace-norm or max- 
norm). A convex class might be easier to optimize over and might be convenient in other 
ways. 

In this paper we focus on networks with rectified linear units and two natural types of 
norm regularization: bounding the norm of the incoming weights of each unit (per-unit reg¬ 
ularization) and bounding the overall norm of all the weights in the system jointly (overall 
regularization, e.g. limiting the overall sum of the magnitudes, or square magnitudes, in 
the system). We generalize both of these with a single notion of group-norm regulariza¬ 
tion: we take the £ v norm over the weights in each unit and then the i q norm over units. In 
Section 3 we present this regularizer and obtain a tight understanding of when it provides 
for size-independent capacity control and a characterization of when it induces convex¬ 
ity. We then apply these generic results to per-unit regularization (Section 4) and overall 
regularization (Section 5), noting also other forms of regularization that are equivalent to 
these two. In particular, we show how per-unit regularization is equivalent to a novel path- 
based regularizer and how overall ( 2 regularization for two-layer networks is equivalent 
to so-called “convex neural networks” (Bengio et al., 2005). In terms of capacity control, 
we show that per-unit regularization allows size-independent capacity-control only with a 
per-unit (j-norm, and that overall £ v regularization allows for size-independent capacity 
control only when p < 2, even if the depth is bounded. In any case, even if we bound the 
sum of all magnitudes in the system, we show that an exponential dependence on the depth 
is unavoidable. 

As far as we are aware, prior work on size-independent capacity control for feed¬ 
forward networks considered only per-unit l x regularization, and per-unit l 2 regularization 
for two-layered networks (see discussion and references at the beginning of Section 4). 
Here, we extend the scope significantly, and provide a broad characterization of the types 
of regularization possible and their properties. In particular, we consider overall norm regu¬ 
larization, which is perhaps the most natural form of regularization used in practice (e.g. in 
the form of weight decay). We hope our study will be useful in thinking about, analyzing 
and designing learning methods using feed-forward networks. Another motivation for us is 
that complexity of large-scale optimization is often related to scale-based, not dimension- 
based complexity. Understanding when the scale-based complexity depends exponentially 
on the depth of a network might help shed light on understanding the difficulties in opti¬ 
mizing deep networks. 
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2. Preliminaries: Feedforward Neural Networks 

A feedforward neural network that computes a function / : M D —* R is specified by a 
directed acyclic graph (DAG) G(V, E ) with D special “input nodes” U| n [l],..., v\ n [D] G V 
with no incoming edges and a special “output node” v ou{ G V with no outgoing edges, 
weights w : E —» R. on the edges, and an activation function a : R —> M. 

Given an input x G R D , the output values of the input units are set to the coordinates 
of x, o(ui n [«]) = x[i] (we might want to also add a special “bias” node with o(u in [0]) = 1, 
or just rely on the inputs having a fixed “bias coordinate”), the output value of internal 
nodes (all nodes except the input and output nodes) are defined according to the forward 
propagation equation: 


o(v) = cr 


I W ( U v )°( u ) > 

\{u^rv)cE J 


( 1 ) 


and the output value of the output unit is defined as o(u ou t) = '52( u ^v 0 ut)eE w ( u v out)o(u). 
The network is then said to compute the function fc:,w,a(x) — o(v out ). Given a graphs G and 
activation function a, we can consider the hypothesis class of functions J\f G ’ a = {fc, w ,a ■ 
M d —> M | w: E — * M} computable using some setting of the weights. 

We will refer to the size of the network, which is the overall number of edges \E\, 
the depth d of the network, which is the length of the longest directed path in G, and the 
in-degree (or width) E[ of a network, which is the maximum in-degree of a vertex in G. 

A special case of feedforward neural networks are layered fully connected networks 
where vertices are partitioned into layers and there is a directed edge from every vertex 
in layer i to every vertex in layer i + 1. We index the layers from the first layer, i — 1 
whose inputs are the input nodes, up to the last layer i — d which contains the single output 
node—the number of layers is thus equal to the depth and the in-degree is the maximal 
layer size. We denote by laycrfr/. II) the layered fully connected network with d layers and 
H nodes per layer (except the output layer that has a single node), and also allow H — oo. 
We will also use the shorthand — J\f layer(d,iT),<r anc j J\fd,rT _ _yyiayer(d,oo),<T_ 

Layered networks can be parametrized by a sequence of matrices W\ G R HxD , IV 2 , 
W 3 , ..., Wd-\ G R HxH , Wd G R 1xH where the row Wfj ,:] contains the input weights to 
unit j in layer i, and 

fw(x) = WdCr(W d -icr(Wd- 2 (- ■ ■ (t(W iX )))), (2) 

where a is applied element-wise. 

We will focus mostly on the hinge, or RELU (REctified Linear Unit) activation, which 
is currently in popular use (Nair and Hinton, 2010; Bordes and Bengio, 2011; Zeiler et al., 
2013), a RELU (^) = [z\ + = max(z,0). When the activation will not be specified, we will 
implicitly be referring to the RELU. The RELU has several convenient properties which 
we will exploit, some of them shared with other activation functions: 
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Lipshitz The hinge is Lipschitz continuous with Lipshitz constant one. This property is 
also shared by the sigmoid and the ramp activation a(z) = min(max(0, z), 1). 
Idempotency The hinge is idempotent, i.e. <r RELU (cr RELU (z)) = a RE[X This property is 
also shared by the ramp and hard threshold activations. 

Non-Negative Homogeneity For a non-negative scalar c > 0 and any input z G R we 
have cr RELU (c • z) = c ■ cr REL u(^)- This property is important as it allows us to scale 
the incoming weights to a unit by c > 0 and scale the outgoing edges by 1/c without 
changing the the function computed by the network. For layered graphs, this means 
we can scale W/ by c and compensate by scaling W i+1 by 1/c. 

We will consider various measures a(w) of the magnitude of the weights w(-). Such 
a measure induces a complexity measure on functions / 6 J\f G,a defined by a G,a (f ) = 
inf fa, w ,<T=f a i w )- The sublevel sets of the complexity measure a G ’° form a family of hy¬ 
pothesis classes Af^< a = {/ £ Af G,a \ a G,a (f) < a}. Again we will use the shorthand 
a d,H,a ant j Q d.a w h en referring to layered graphs layer(d, H) and layer(d, oo) respectively, 
and frequently drop a when RELU is implicitly meant. 

For binary function g : {±1} D —> ±1 we say that g is realized by / with unit margin if 
\/ x f(x)g(x) > 1. A set of points S is shattered with unit margin by a hypothesis class J\f if 
all g : S —> ±1 can be realized with unit margin by some / e Af. 

3. Group Norm Regularization 

Considering the grouping of weights going into each edge of the network, we will consider 
the following generic group-norm type regularizer, parametrized by 1 < p, q < oo: 



( 3 ) 


Here and elsewhere we allow q = oo with the usual conventions that z G ) Gq = sup z i 
and 1/q — 0 when it appears in other contexts. When q = oo the group regularizer (3) 
imposes a per-unit regularization, where we constrain the norm of the incoming weights of 
each unit separately, and when q = p the regularizer (3) is an “overall” weight regularizer, 
constraining the overall norm of all weights in the system. E.g., when q = p = 1 we are 
paying for the sum of all magnitudes of weights in the network, and q = p = 2 corresponds 
to overall weight-decay where we pay for the sum of square magnitudes of all weights 
(i.e. the overall Euclidean norm of the weights). 
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For a layered graph, we have: 


H 


H 


q/p\ V9 


1/9 


ft.,,(»')= (EE 

k=l i =1 \j=l 




'p,q 


k =i 


1/d 


> 


dV * (n 11^* 


'p,q 


vfc=l 




(4) 


where y P:q (W) = ||W fc || p(? aggregates the layers by multiplication instead of summa- 


k =l 


tion. The inequality (4) holds regardless of the activation function, and so for any a we 
have: 


7, 


P.9 


(/)< 




d,H,a 


V)p* 


d l /i 


( 5 ) 


But due to the homogeneity of the RELU activation, when this activation is used we can 
always balance the norm between the different layers without changing the computed func¬ 
tion so as to achieve equality in (4): 

Claim 1 For any f G /iJf’ CTRELU (/) = d 1/q \J 7 p;f’ ffRELU (/). 

Proof Let IF be weights that realizes / and are optimal with respect to 7 Pi9 ; i.e. y p . q {W) = 
7 p,q(f)- L et W4 = \Z'y P ,q(W)Wk/ ||PFfe|| , and observe that they also realize /. We now 
have: 

(/) £ = (EL 


H4 


P.9 


1/9 


= (4w»o) ,/ 'y / " = 


which together with (4) completes the proof. 


The two measures are therefore equivalent when we use RELUs, and define the same level 
sets, or family of hypothesis classes, which we refer to simply as M p q . In the remainder 
of this Section, we investigate convexity and generalization properties of these hypothesis 
classes. 


3.1. Generalization and Capacity 

In order to understand the effect of the norm on the sample complexity, we bound the 
Rademacher complexity of the classes ■ Recall that the Rademacher Complexity is 
a measure of the capacity of a hypothesis class on a specific sample, which can be used 
to bound the difference between empirical and expected error, and thus the excess gener¬ 
alization error of empirical risk minimization (see, e.g., Bartlett and Mendelson (2003) for 
a complete treatment, and Appendix A for the exact definitions we use). In particular, the 
Rademacher complexity typically scales as y/C/m, which corresponds to a sample com¬ 
plexity of O (C/e 2 ), where m is the sample size and C is the effective measure of capacity 
of the hypothesis class. 


5 












Neyshabur Tomioka Srebro 


Theorem 1 For any d, q > 1, any 1 < p < oo and any set S = {xi,..., x m } C R D : 

(d-1) 


n m {M: 


'd,H,CTRELl! 

7p,<?<7 


)<7 2 H 


linear 
, '"m,p,D 


< 


\ 


7 2 ( 2H 


■ i_i] 


2(d—1) 


min{p*, 4 log(2D)} max* ||xj 


m 


so: 


^wi£gr) < » a H 




7 77 


(d-i) 


' 7 ~> linear 


< 


\ 




2(d—l) 


min{p*, 4 log(2D)} max* ||Xi 


m 


where the second inequalities hold only ifl < p < 2,1Zis the Rademacher complexity 
of D-dimensional linear predictors with unit £ p norm with respect to a set of m samples 
and p* is such that -7 + ^ = 1. 


Proof sketch 

1 < p < oo, 


We prove the bound by induction, showing that for any q, d > 1 and 


KmWZZT 3 ) 




iy,H — i 


The intuition is that when p* < q, the Rademacher complexity increases by simply dis¬ 
tributing the weights among neurons and if p* > q then the supremum is attained when 
the output neuron is connected to a neuron with highest Rademacher complexity in the 
lower layer and all other weights in the top layer are set to zero. For a complete proof, see 

Appendix A. ■ 

1 

Note that for 2 < p < oo, the bound on the Rademacher complexity scales with m? 
(see section A.l in appendix) because: 


n 


linear 
fn,p,D — 


< 


V2M 


2 ,P* 


< 


V2 


max.; \\Xi 


m 


i 

mp 


( 6 ) 


The bound in Theorem 1 depends on both the magnitude of the weights, as captured by 
I'p.qiW) or y rKq (W), and also on the width H of the network (the number of nodes in each 
layer). However, the dependence on the width H disappears, and the bound depends only 
on the magnitude, as long as q < p* (i.e. I/p + 1/q > 1). This happens, e.g., for overall 
and i-i regularization, for per-unit li\ regularization, and whenever 1/p + 1/q = 1. In such 
cases, we can omit the size constraint and state the theorem for an infinite-width layered 
network (i.e. a network with an infinitely countable number of units, when the number of 
units is allowed to be as large as needed): 
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Corollary 2 For any d > 1, 1 < p < oo and 1 < q < p* = p/(p — 1), and any set 
S = {xi,... ,x m } C R d , 




linear 
D 


< 


\ 


7 


i_ii 

P W Q 


2(d-l) 2 

min{p*, 41og(2.D)} max* ||xj|| * 


m 


and so: 




linear 
D 


< 


N 


2/x/ s/ctj min {p* , 4 log(2.D)} max* ||xj 


m 


where the second inequalities hold only if 1 < p < 2 and 1Z'^ D is the Rademacher 
complexity of D-dimensional linear predictors with unit i p norm with respect to a set of m 
samples. 


3.2. Tightness 

We next investigate the tightness of the complexity bound in Theorem 1, and show that 
when 1/p + 1/q < 1 the dependence on the width H is indeed unavoidable. We show 
not only that the bound on the Rademacher complexity is tight, but that the implied bound 
on the sample complexity is tight, even for binary classification with a margin over binary 
inputs. To do this, we show how we can shatter the m = 2 D points {il} 15 using a network 
with small group-norm: 

Theorem 3 For any p, q > 1 (and 1/p* + 1/p = 1 ) and any depth d > 2, the m = 2 ,} 
points {±1} D can be shattered with unit margin by A/a^ <7 with: 

ry < pflV m 1 /P +1 / < l }{-( d - 2 )l 1 /P*- 1 /<l] + 

Proof Consider a size m subset S rn of 2 D vertices of the D dimensional hypercube 
{—1, +1}-°. We construct the first layer using m units. Each unit has a unique weight 
vector consisting of +1 and — l’s and will output a positive value if and only if the sign 
pattern of the input x e S rn matches that of the weight vector. The second layer has 
a single unit and connects to all m units in the first layer. For any m dimensional sign 
pattern b G { — 1, +l} m , we can choose the weights of the second layer to be b, and the 
network will output the desired sign for each x G S rn with unit margin. The norm of the 
network is at most (m • D q l p ) 1 l q ■ m 1/,p = D { / p ■ rn^ l ^ p+1 / q \ This establishes the claim 
for d = 2. For d > 2 and 1/p + 1/q > 1, we obtain the same norm and unit margin 
by adding d — 2 layers with one unit in each layer connected to the previous layer by a 
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unit weight. For d > 2 and 1/p + 1/q < 1, we show the dependence on H by recur¬ 
sively replacing the top unit with H copies of it and adding an averaging unit on top of 
that. More specifically, given the above d = 2 layer network, we make H copies of the 
output unit with rectified linear activation and add a 3rd layer with one output unit with 
uniform weight 1/H to all the copies in the 2nd layer. Since this operation does not change 
the output of the network, we have the same margin and now the norm of the network is 
(m- Di/p) 1 ^ ■ (Hm q / p ) l / q ■ (H(l/ FP)) 1 ^ = £>Vp . m (i/p+V<d . H y q -i/v\ That is, we have 
reduced the norm by factor II 1 /' ? ~ 1 / p *. By repeating this process, we get the geometric 
reduction in the norm //( d- 2 ) ( 1 / | j- 1 /p *) 5 which concludes the proof. ■ 


To understand this lower bound, first consider the bound without the dependence on 
the width H. We have that for any depth d > 2, 7 < m r D = nrf log m (since 1/p < 1 
always) where r = 1/p + 1/q < 2. This means that for any depth d > 2 and any p. q 
the sample complexity of learning the class scales as m — fi( 7 1//r / log 7) > (2(^/7). This 
shows a polynomial dependence on 7, though with a lower exponent than the 7 2 (or higher 
for p > 2) dependence in Theorem 1. Still, if we now consider the complexity control as a 
function of p P)q we get a sample complexity of at least Vt{p d / 2 / log p), establishing that if 
we control the group-norm as in (3), we cannot avoid a sample complexity which depends 
exponentially on the depth. Note that in our construction, all other factors in Theorem 1, 
namely max* 11 .x, |[ and log D, are logarithmic (or double-logarithmic) in m. 

Next we consider the dependence on the width H when 1/p + 1/q < 1. Here we 
have to use depth d > 3, and we see that indeed as the width H and depth d increase, 
the magnitude control 7 can decrease as without decreasing the capacity, 

matching Theorem 1 up to an offset of 2 on the depth. In particular, we see that in this 
regime we can shatter an arbitrarily large number of points with arbitrarily low 7 by using 
enough hidden units, and so the capacity of J\f pq is indeed infinite and it cannot ensure any 
generalization. 

3.3. Convexity 

Finally we establish a sufficient condition for the hypothesis classes JV d q to be convex. We 
are referring to convexity of the functions in the J\f pq independent of a specific represen¬ 
tation. If we consider a, possibly regularized, empirical risk minimization problem on the 
weights, the objective (the empirical risk) would never be a convex function of the weights 
(for depth d > 2 ), even if the regularizer is convex in w (which it always is for p,q > 1). 
But if we do not bound the width of the network, and instead rely on magnitude-control 
alone, we will see that the resulting hypothesis class, and indeed the complexity measure, 
may be convex (with respect to taking convex combinations of functions, not of weights). 

Theorem 4 For any d,P,q > 1 such that ± < ^(l - ±), 7 £,(/) is a semi-norm in Af d . 

In particular, under the condition of the Theorem, y/ is convex, and hence its sublevel sets 
K d , are convex, and so p/ q is quasi-convex (but not convex). 
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Proof sketch To show convexity, consider two functions /, g <G M d p q<1 and 0 < a < 1, 
and let U and V be the weights realizing / and g respectively with 7 p, q (U) < 7 and 
7 p, q (V) < 7 . We will construct weights W realizing af + (1 — a)g with 7 p , q (W ) < 7 . 
This is done by first balancing U and V s.t. at each layer || Ui || = y/'y P , q (U) and || V* || = 

an d then placing U and V side by side, with no interaction between the units 
calculating / and g until the output layer. The output unit has weights aUd coming in from 
the /-side and weights (1 — a)Vd coming in from the y-side. In Appendix B we show that 
under the condition in the theorem, y P) q{W) < 7 . To complete the proof, we also show liq 
is homogeneous and that this is sufficient for convexity. ■ 

4. Per-Unit and Path Regularization 

In this Section we will focus on the special case of q = 00, i.e. when we constrain the norm 
of the incoming weights of each unit separately. 

Per-unit t v -regularization was studied by Bartlett (1998); Koltchinskii and Panchenko 
(2002); Bartlett and Mendelson (2003) who showed generalization guarantees. A two-layer 
network of this form with RELU activation was also considered by Bach (2014), who stud¬ 
ied its approximation ability and suggested heuristics for learning it. Per-unit regular¬ 
ization in a two-layer network was considered by Cho and Saul (2009), who showed it is 
equivalent to using a specific kernel. We now introduce Path regularization and discuss its 
equivalence to Per-Unit regularization. 

Path Regularization Consider a regularizer which looks at the sum over all paths from 
input nodes to the output node, of the product of the weights along the path: 

k 

M w ) = ( ni^r) /p (7) 

Vm[i]‘^-Vl%V2---$Vout 1 

where p > 1 controls the norm used to aggregate the paths. We can motivate this regularizer 
as follows: if a node does not have any high-weight paths going out of it, we really don’t 
care much about what comes into it, as it won’t have much effect on the output. The 
path-regularizer thus looks at the aggregated influence of all the weights. 

Referring to the induced regularizer fip(f) = min f Gw =f<j) P (w) (with the usual short¬ 
hands for layered graphs), we now observe that for layered graphs, path regularization and 
per-unit regularization are equivalent: 

Theorem 5 Forp > 1, any d and (finite or infinite) H,for any f e M d ' u : o'}, 11 (f ) = 7 

It is important to emphasize that even for layered graphs, it is not the case that for all 
weights <p p (w) = 7 Pi oo(w)- E.g., a high-magnitude edge going into a unit with no non-zero 
outgoing edges will affect 7 Pi 00 (te) but not 4> p (w), as will having high-magnitude edges 
on different layers in different paths. In a sense path regularization is as more careful 
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regularizer less fooled by imbalance. Nevertheless, in the proof of Theorem 5 in Appendix 
C.l, we show we can always balance the weights such that the two measures are equal. 

The equivalence does not extend to non-layered graphs, since the lengths of different 
paths might be different. Again, we can think of path regularizer as more refined regularizer 
taking into account the local structure. However, if we consider all DAGs of depth at most d 
(i.e. with paths of length at most d), the notions are again equivalent (see proof in Appendix 
C.2): 


Theorem 6 For any p > 1 and any d: Tp,oo(f) 


min 

G G DAG(d) 



In particular, for any graph G of depth d, we have that 4>p(f) > 7p iQO (/). Combining 
this observation with Corollary 2 allows us to immediately obtain a generalization bound 
for path regularization on any, even non-layered, graph: 

Corollary 7 For any graph G of depth d and any set S = {xi,..., x m } C M D ; 




'A d ~ l f 2 ■ 4 log(2 D) sup ||Xi 


12 
I oo 


m 


Note that in order to apply Corollary 2 and obtain a width-independent bound, we had to 
limit ourselves to p = 1. We further explore this issue next. 

Capacity As was previously noted, size-independent generalization bounds for bounded 
depth networks with bounded per-unit norm have long been known (and make for a 
popular homework problem). These correspond to a specialization of Corollary 2 for the 
case p — 1, q — oo. Furthermore, the kernel view of Cho and Saul (2009) allows obtaining 
size-independent generalization bound for two-layer networks with bounded per-unit i 2 
norm (i.e. a single infinite hidden layer of all possible unit-norm units, and a bounded i 2 - 
norm output unit). However, the lower bound of Theorem 3 establishes that for any p > 1, 
once we go beyond two layers, we cannot ensure generalization without also controlling 
the size (or width) of the network. 

Convexity An immediately consequence of Theorem 4 is that per-unit regularization, 
if we do not constrain the network width, is convex for any p > 1. In fact, y poo is a 
(semi)norm. However, as discussed above, for depth d > 2 this is meaningful only for 
p = 1 , as y poo collapses forp > 1 . 

Hardness Since the classes are convex, we might hope that this might make learn¬ 
ing computationally easier. Indeed, one can consider functional-gradient or boosting-type 
strategies for learning a predictor in the class (Lee et al., 1996). However, as Bach (2014) 
points out, this is not so easy as it requires finding the best fit for a target with a RELU unit, 
which is not easy. Indeed, applying results on hardness of learning intersections of half¬ 
spaces, which can be represented with small per-unit norm using two-layer networks, we 
can conclude that, subject to certain complexity assumptions, it is not possible to efficiently 
PAC leam ,Vj , 00 , even for depth d = 2 when y liQO increases superlinearly: 
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Corollary 8 Subject to the the strong random CSP assumptions in Daniely et al. (2014), 
it is not possible to efficiently PAC learn (even improperly) functions { ± 1 } D —y {±1} 
realizable with unit margin by ,\ff ^ when 7 i iOC = uj ( I )) (e.g. when 7 i i0O = I) log I)), 
Moreover, subject to intractability of Q(D 15 )-unique shortest vector problem, for any e > 
0, it is not possible to efficiently PAC learn (even improperly) functions { ± 1} D —y {±1} 
realizable with unit margin by when 7 l oo = D 1+e . 

This is a corollary of Theorem 22 in the Appendix D. Either versions of corollary 8 pre¬ 
cludes the possibility of learning in time polynomial in 7 i j0O , though it still might be possi¬ 
ble to learn in polyf /A) time when 7 l oo is sublinear. 

Sharing We conclude this Section with an observation on the type of networks obtained 
by per-unit, or equivalently path, regularization. 

Theorem 9 For any p > 1 and d > 1 and any f G J\T d , there exists a layered graph 
G(V,E) of depth d, such that f G J\f G and 7 p j0O (f) — ¥p{f) = 7p,oo(/)> an d the out- 
degree of every internal (non-input) node in G is one. That is, the subgraph of G induced 
by the non-input vertices is a tree directed toward the output vertex. 

What the Theorem tells us is that we can realize every function as a tree with optimal per- 
unit norm. If we think of learning with an infinite fully-connected layered network, we 
can always restrict ourselves to models in which the non-zero-weight edges form a tree. 
This means that when using per-unit regularization we have no incentive to “share” lower- 
level units—each unit will only have a single outgoing edge and will only be used by a 
single down-stream unit. This seems to defy much of the intuition and power of using deep 
networks, where we expect lower layers to represent generic feature useful in many higher- 
level features. In effect, we are not encouraging any transfer between learning different 
aspects of the function (or between different tasks or classes, if we do have multiple output 
units). Per-unit regularization therefore misses out on much of the inductive bias that we 
might like to impose when using deep learning (namely, promoting sharing). 

Proof [of Theorem 9] For any fa, w G J\f DAG ( d \ we show how to construct such G and w. 
We first sort the vertices of G based on topological ordering such that the out-degree of the 
first vertex is zero. Let G 0 = G and w 0 = w. At each step i, we first set G; = Gj_i and 
Wi = Wi -1 and then pick the vertex u that is the ?th vector in the topological ordering. If the 
out-degree of u is at most 1. Otherwise, for any edge (u —y v ) we create a copy of vertex 
u that we call it u v , add the edge (u v —y v) to G* and connect all incoming edges of u 
with the same weights to every such u v and finally we delete the vertex u from G, together 
with all incoming and outgoing edges of u. It is easy to indicate that fc,. Wi = fci_ 1 ,w i - 1 - 
After at most \ V\ such steps, all internal nodes have out-degree one and hence the subgraph 
induced by non-input vertices will be a tree. ■ 


11 


Neyshabur Tomioka Srebro 


5. Overall Regularization 

In this Section, we will focus on “overall” i v regularization, corresponding to the choice 
q = p, i.e. when we bound the overall (vectorized) norm of all weights in the system: 

fbA w ) = H e )l P ) /V ■ 

e£E 

Capacity For p < 2, Corollary 2 provides a generalization guarantee that is independence 
of the width—we can conclude that if we use weight decay (overall ( 2 regularization), or 
any tighter £ p regularization, there is no need to limit ourselves to networks of finite size 
(as long as the corresponding dual-norm of the inputs are bounded). However, in Section 
3.2 we saw that with d > 3 layers, the regularizer degenerates and leads to infinite capacity 
classes if p > 2. In any case, even if we bound the overall £ r norm. the complexity increases 
exponentially with the depth. 

Convexity The conditions of Theorem 4 for convexity of Af.f 2 are ensured when p > d. 
For depth d — 1, i.e. a single unit, this just confirms that ^-regularized linear prediction 
is convex for p > 1. For depth d = 2, we get convexity with ( 2 regularization, but not 
l\. For depth d > 2 we would need p > d > 3, however for such values of p we know 
from Theorem 3 that A/^ p degenerates to an infinite capacity class if we do not control the 
width (if we do control the width, we do not get convexity). This leaves us with A /" 2 2 2 as 
the interesting convex class. Below we show an explicit convex characterization of N 2 2 by 
showing it is equivalent to so-called “convex neural nets”. 

Convex Neural Nets (Bengio et al., 2005) over inputs in are two-layer networks 
with a fixed infinite hidden layer consisting of all units with weights w G Q for some base 
class Q G R D , and a second (j -regularized layer. Since over finite data the weights in the 
second layer can always be taken to have finite support (i.e. be non-zero for only a finite 
number of first-layer units), and we can approach any function with countable support, we 
can instead think of a network in J\f 2 where the bottom layer is constraint to Q and the top 
layer is i\ regularized. Focusing on Q = {w | ||w|| p < 1}, this corresponds to imposing 
an £ p constraint on the bottom layer, and () regularization on the top layer and yields the 
following complexity measure over JV 2 : 


Mf) 


inf ||W 2 ||i 

/layer(d)W=/,S.t.V,||Wl[j,:]|| p <l 


( 8 ) 


This is similar to per-unit regularization, except we impose different norms at different 
layers (if p ^ 1). We can see that NT 2 p<ll = v ■ conv(cr((/)), and is thus convex for any p. 
Focusing on RELU activation we have the equivalence: 

Theorem 10 = 2 u 2 {f). 

That is, overall i 2 regularization with two layers is equivalent to a convex neural net with 
£ 2 -constrained units on the bottom layer and l\ (not £ 2 !) regularization on the output. 


12 



Norm-Based Capacity Control in Neural Networks 


Proof We can calculate: 


H / D 

mmnl 2 (W) = mm^ I l^iMI 2 + \ W M 2 

j=l V i=l 


= I'UMI 2 ■ lab'll (9) 

i=i 

H I - D - 

= 2 mill l EWH st - vE , |M^i(jA ]| 2 < i- (10) 

fw=f “ V ^^*=1 

J =1 

Here (9) is the arithmetic-geometric mean inequality for which we can achieve equality by 
balancing the weights (as in Claim 1) and (10) again follows from the homogeneity of the 
RELU which allows us to rebalance the weights. ■ 


Hardness As with J\ff tQO , we might hope that the convexity of J\f'[ 2 might make it com¬ 
putationally easy to learn. However, by the same reduction from learning intersection of 
halfspaces (Theorem 22 in Appendix D) we can again conclude that we cannot learn in 
time polynomial in p\ 2 : 

Corollary 11 Subject to the the strong random CSP assumptions in Daniely et al. (2014), 
it is not possible to ejficiently PAC learn (even improperly) functions { ± 1} D -A {±1} 
realizable with unit margin by J\f pp when p pp = uj(Dp). (e.g. when j 1<00 = I) log I)), 
Moreover, subject to intractability of Q(D 1 - 5 )-unique shortest vector problem, for any e > 
0, it is not possible to efficiently PAC learn (even improperly) functions {± 1} D — > {±1} 
realizable with unit margin by J\fi iOQ when 7 i !00 = Dp +£ . 

6. Depth Independent Regularization 

Up until now we discussed relying on magnitude-based regularization instead of directly 
controlling network size, thus allowing unbounded and even infinite width. But we still 
relied on a finite bound on the depth in all our derivations. Can the explicit dependence on 
the depth be avoided, and replaced with only a measure of scale of the weights? 

We already know we cannot rely only on a bound on the group-norm p p when the 
depth is unbounded, as we know from Theorem 3 that in terms of fi p , q the sample complex¬ 
ity necessarily increases exponentially with the depth: if we allow arbitrarily deep graphs 
we can shrink p p . q toward zero without changing the scale of the computed function. How¬ 
ever, controlling the y-measure, or equivalently the path-regularizer 0 , in arbitrarily-deep 
graphs is sensible, and we can define: 

7 p,q = inf 7 i,q{f) = Jim 7^(/) or: f p = inf 4%{f) (11) 

d>l d —>00 G 
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where the minimization is over any DAG. From Theorem 6 we can conclude that <j> p (f) = 
7 P ,oo(/)- In any case, 7 p , q {f) is a sensible complexity measure, that does not collapse 
despite the unbounded depth. Can we obtain generalization guarantees for the class A/" > f; < 7 
? 

Unfortunately, even when 1/p + l/q > 1 and we can obtain width-independent bounds, 
the bound in Corollary 2 still has a dependence on 4 d , even if is bounded. Can such a 
dependence be avoided? 

For anti-symmetric Lipschitz-continuous activation functions (i.e. such that cr(— z) = 
— a(z )), such as the ramp, and for per-unit ^-regularization /if we can avoid the factor 
of 4 d 


Theorem 12 

R d : 


For any anti-symmetric 1 -Lipschitz.function a and any set S = {x \...., x rn } C 


72, 




'4/i 2d log (2D) sup || Xi 


i2 
I OO 


m 


The proof is again based on an inductive argument similar to Theorem 1 and you can find 
it in appendix A. 4. 

However, the ramp is not homogeneous and so the equivalent between //, 7 and </> breaks 
down. Can we obtain such a bound also for the RELU? At the very least, what we can say 
is that an inductive argument such that used in the proofs of Theorems 1 and 12 cannot be 
used to avoid an exponential dependence on the depth. To see this, consider 7 l oo < 1 (this 
choice is arbitrary if we are considering the Rademacher complexity), for which we have 


A f d + l — 


conv 


«.< 1) 


( 12 ) 


where conv(-) is the symmetric convex hull, and [■]+ = max (z, 0) is applied to each func¬ 
tion in the class. In order to apply the inductive argument without increasing the complex¬ 
ity exponentially with the depth, we would need the operation [couvCH)] + to preserve the 
Rademacher complexity, at least for non-negative convex cones Ft. However we show a 
simple example of a non-negative convex cone FL for which Fl m ([conv('H)]+) > lZ m {FL). 

We will specify FL as a set of vectors in M m , corresponding to the evaluation of // (.x, ) 
of different functions in the class on the m points x t in the sample. In our construction, 
we will have only m = 3 points. Consider FL = conv({(l, 0,1), (0,1,1)}), in which case 
FL' = [conv('H)] + = conv({(l, 0,1), (0,1,1), (0.5, 0, 0)}). It is not hard to verify that 
Timin') = g > g = 


7. Summary and Open Issues 

We presented a general framework for norm-based capacity control for feed-forward net¬ 
works, and analyzed when the norm-based control is sufficient and to what extent capacity 
still depends on other parameters. In particular, we showed that in depth d > 2 networks, 
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per-unit control with p > 1 and overall regularization with p > 2 is not sufficient for capac¬ 
ity control without also controlling the network size. This is in contrast with linear models, 
where with any p < oo we have only a weak dependence on dimensionality, and two-layer 
networks where per-unit p = 2 is also sufficient for capacity control. We also obtained 
generalization guarantees for perhaps the most natural form of regularization, namely 1 2 
regularization, and showed that even with such control we still necessarily have an expo¬ 
nential dependence on the depth. 

Although the additive //-measure and multiplication 7 -measure are equivalent at the 
optimum, they behave rather differently in terms of optimization dynamics (based on anec¬ 
dotal empirical experience) and understanding the relationship between them, as well as the 
novel path-based regularizer can be helpful in practical regularization of neural networks. 

Although we obtained a tight characterization of when size-independent capacity con¬ 
trol is possible, the precise polynomial dependence of margin-based classification (and 
other tasks) on the norm in might not be tight and can likely be improved, though this 
would require going beyond bounding the Rademacher complexity of the real-valued class. 
In particular, Theorem 1 gives the same bound for per-unit l\ regularization and overall /:, 
regularization, although we would expect the later to have lower capacity. 

Beyond the open issue regarding depth-independent 7 -based capacity control, another 
interesting open question is understanding the expressive power of particularly 

as a function of the depth d. Clearly going from depth d = 1 to depth d = 2 provides 
additional expressive power, but it is not clear how much additional depth helps. The class 
A /" 2 already includes all binary functions over {± 1} D and is dense among continuous real¬ 
valued functions. But can the 7 -measure be reduced by increasing the depth? Viewed 
differently: 7 p q (f) is monotonically non-increasing in d, but are there functions for it 
continues decreasing? Although it seems obvious there are functions that require high 
depth for efficient representation, these questions are related to decade-old problems in 
circuit complexity and might not be easy to resolve. 
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Appendix A. Rademacher Complexities 

The sample based Rademacher complexity of a class T of function mapping from X to M 
with respect to a set S' = {x±, ..., x m } is defined as: 


'R-miF) 


%e{±i} m 


1 

— sup 

m S cT 


Z— 1 


In this section, we prove an upper bound for the Rademacher complexity of the class 
i.e., the class of functions that can be represented as depth d, width H network 
with rectified linear activations, and the layer-wise group norm complexity y p>q bounded by 
7. As mentioned in the main text, our proof is an induction with respect to the depth d. We 
start with d — 1 layer neural networks, which is essentially the class of linear separators. 


A.l. (^-regularized Linear Predictors 

For completeness, we prove the upper bounds on the Rademacher complexity of class of 
linear separators with bounded l p norm. The upper bounds presented here are particularly 
similar to generalization bounds in Kakade et al. (2009) and Balcan and Berlind (2014). 
We first mention two already established lemmas that we use in the proofs. 

Theorem 13 (Khintchine-Kahane Inequality) For any 0 < p < 00 and S = {z\. ..., z m j, 
if the random variable £ is uniform over {±l} m , then 



where C p is a constant depending only on p. 

The sharp value of the constant C p was found by Haagerup (1981) but for our analysis, it 
is enough to note that if p > 1 we have C p < sjp. 
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Lemma 14 (Massart Lemma) Let A be a finite set of m dimensional vectors. Then 


Ec 


max 
aeA m 


-x> 

m < ^ 


2=1 


< max II a I 

adA 


^/2\og\A\ 


m 


where A is the cardinality of A. 

We are now ready to show upper bounds on Rademacher complexity of linear separators 
with bounded i p norm. 


Lemma 15 (Rademacher complexity of linear separators with bounded ( p norm) For any 
d, q > 1, For any 1 < p < 2, 


and for any 2 < p < oo 


7p,g<7' — 


1 7 2 minjp*, 4 log(2Z2)} max,; ||x 


2 

* lln* 


m 


, ) < 

7p,<J<7' — 


^ v^7 max i Ikil 


< 


m 


i 

mp 


where p* is such that -4 + - = 1 . 
1 v v 


Proof First, note that JV 1 is the class of linear functions and hence for any function f w e 
A f 1 , we have that 7 p , q (w) = ||w|| . Therefore, we can write the Rademacher complexity 
for a set S' = {27, ..., x m } as: 


n m {K 


lv,q<l' 


= E 




m 


sup 

Ml p <7 


iW T Xi 


= E 




— sup 

m IML<7 


2=1 

m 


Xi 


2=1 


— 7®4e{±i} m 


1 

m 


2=1 


P* 
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For 1 < p < min |2, 210 ^ 21 ^- 1 } ( an d therefore 2 log (2D) < p*), we have 


— 7E^g{-tl}m 

1 

m 

m 

2=1 

p* 

< D^ 7 E ?e{± i }m 

1 

m 

m 

7 — 1 

< D 21 °S( 2D )7E ?e{± l}m 

1 

m 

m 

E 

2=1 

< \/ 27 E ?G{± i }m 

1 

m 

m 

E« 

2=1 

2*^2 


i'Xq 


We now use the Massart Lemma viewing each feature (xi[j])iLi f o r 3 = 1, 
member of a finite hypothesis class and obtain 




e{±i}" 


m 


Eg 

2 =1 


Xi 


< 2 7 ^ los ^ 2£> l max ||(Db'])i=ill 2 


m j=l..;D 


/log (2D) 

< 2 y\/- max ||ag| 


J7i 


/log(2D) 

< 271 /- max Xj « 

77i ^ 


If min 12, } < P < 00 , by Khintchine-Kahane inequality we have 


^m(.A/^ 9 < 7 ) — 7E^ g {±i}« 


m 


i=l 




D 


^-(£% {±1} 

j=i 


i/p* 


^^(E“ 1 nfebifeiir) 1/ "'=T^iwi 2 , P 

If p* > 2, by Minskowski inequality we have that ||X || 2 * < m 1 / 2 max, ||xj|| *. 

* 

by subadditivity of the function f(z) = z~z, we get ||X || 2 » < m 1 / p * max* ||xj 


..., D as a 


Otherwise, 

Ip* * 
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A.2. Theorem 1 

We define the hypothesis class J\f d,H,H to be the class of functions from X to computed 
by a layered network of depth d, layer size H and H outputs. 

For the proof of theorem 1, we need the following two technical lemmas. The first is 
the well-known contraction lemma: 


Lemma 16 (Contraction Lemma) Let function <f> : R —> R be Lipschitz with constant C rj) 
such that f satisfies 0(0) = 0. Then for any class T of functions mapping from X to R and 
any set S = {xi,..., x m }: 


1 

— sup 

m 


1 

— sup 

m 

m 

2=1 


m / G t 

2=1 


Next, the following lemma reduces the maximization over a matrix W G M. HxH that ap¬ 
pears in the computation of Rademacher complexity to H independent maximizations over 
a vector w G M /; (the proof is deferred to Subsection A. 3): 

Lemma 17 For any p, q > 1, d > 2, £ G {±l} m and f G M d ' H ' H we have 


T \m 


p,q 




2 = 1 


r 1 _ 3_i 1 

— jp-pe q j+ SU p -- 

w \\w\\ 


J2^i[w T [f(Xi)]. 


2=1 


where p* is such that -4 + - = 1. 

1 p p 

Theorem 1 For any d, p, q > 1 and any set S = {xi ,... ,x m } C R-°: 


K m ( K 


■d,H,a RELV \ 


< 


lp,q<l > ~ 


\ 


7 


2(d—1) 


min{p*, 2 log(2L>)} sup ||x* 


m 


and so: 


n m (K 


’<1. ff.rr RELU \ ^ A 


P 


2d 


2 H l 


2(d—l) 


min{p*, 2 log(2D)} sup ||x* 


m 


where p* is such that 4 + - = 1. 

r P P 
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Proof By the definition of Rademacher complexity if £ is uniform over (±l} m , we have: 


n m (K d ' H 


r Yp,q< / y' 


= E,e 


— Ei 


— Ec 


= Ee 


= Ec 


— sup 

m feM d,n 

J 7p,?<7 


sup 


/ p\ 

m f£Af d ’ H 7 p,q{j ) 

1 


m 


7 


i— 1 


7 


— sup sup - 

U7 g€ N d ~ x ' H ' H w 7 p,q\9) ll^fi 


^2iiW T [g(xi)l 


i =1 


— sup 


7 


771 g€ J^d-l,H,H ^p,q{g) 




i— 1 


— sup 


7 1 

sup 




P,Q 


= H [ ^~\ ]+ Ei 


Z— 1 
1 


1 7 

— SUP -77T SUp — 

^ < l/p,Q\“ / ) w 11 ^ 


m 

^2ti[w T [h(xi)] 


i= 1 


— sup 

771 _.ni-l ,H 

V^P.qk-y 


< 2H^p* ^ + Et 


— sup 

777 _ , r d-l,H 


= 1 
m 


i —1 




(13) 


(14) 


where the equality (13) is obtained by lemma 17 and inequality (14) is by Contraction 
Lemma. This will give us the bound on Rademacher complexity of J\f d ^ <7 based on the 

Rademacher complexity of Applying the same argument on all layers and using 

lemma 15 to bound the complexity of the first layer completes the proof. ■ 


A.3. Proof of Lemma 17 

Proof It is immediate that the right hand side of the equality in the statement is always 
less than or equal to the left hand side because given any vector w in the right hand side, 
by setting each row of matrix W in the left hand side we get the equality. Therefore, it is 
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enough to prove that the left hand side is less than or equal to the right hand side. For the 
convenience of notations, let g(w) = \ Y^Li £i wT [f( x i)}+1 - Define w to be: 


~ a 9(w) 

w — arg max t:— 

w IMI P 

If q < p*. then the right hand side of equality in the lemma statement will reduce to 
g(w) / || w || and therefore we need to show that for any matrix V, 


g{w) > \\g( v )\\ P * 
Nip ““ W V \\p,q 


Since q < p*, we have \\V\\ 
inequality: 


On the other hand, if q > p*. 


r,p* < \\n M and hence it is enough to prove the following 


gjw) „ l|g(V)ll p - 



then we need to prove the following inequality holds: 


i_i g(w) 

H p i -> 


\w\ 


MV)\\ P , 

\\n M 


Since q > p*, we have that ||V|| , < ||V|| . Therefore, it is again enough to 

show that: 

g{w) > ll^(^)ll P * 

IHIp “ IMU* ' 

We can rewrite the above inequality in the following form: 


E 




W 


W 


H 


> E sw" 


By the definition of w, we know that the above inequality holds for each term in the sum 
and hence the inequality is true. ■ 


A.4. Theorem 12 


The proof is similar to the proof of theorem 1 but here bounding pi >00 by p, means the 
t\ norm of input weights to each neuron is bounded by //. We use a different version of 
Contraction Lemma in the proof that is without the absolute value: 


Lemma 18 (Contraction Lemma (without the absolute value)) Let function f : M —> K be 
Lipschitz with constant £ 0 . Then for any class T of functions mapping from X to M and 
any set S = {aq,..., x m }: 


®£€{±l} m 


_| lib 

—sup y ]Zi<f>(f{xi)) 


< E,yIE§e{±l} m 
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Theorem 12 

R d : 


For any anti-symmetric 1 -Lipschitz function a and any set S = {a; i ,... ,x m } C 




2 n 2d log( 2 D) sup ||xj 
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I oo 


rn 


Proof Assuming £ is uniform over {±l} m , we have: 
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(15) 


(16) 


where the equality (15) is by anti-symmetric property of a and inequality (16) is by the 
version of Contraction Lemma without the absolute value. This will give us the bound 
on Rademacher complexity of A' : ’' / E </J based on the Rademacher complexity of A r / ' , | . 

Applying the same argument on all layers and using lemma 15 to bound the complexity of 
the first layer completes the proof. ■ 
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Appendix B. Proof that 7^ (/) is a semi-norm in J\f d 

We repeat the statement here for convenience. 

Theorem 4 For any d,P,q> 1 such that \ < ^(l - ±), 7 is a semi-norm in J\f d . 
Proof The proof consists of three parts. First we show that the level set JV d d < = {/ G 

Tv i—'t 

J\f d : 7 p q (f) < 7 } is a convex set if the condition on d,p , q is satisfied. Next, we establish 
the non-negative homogeneity of 7 d (/). Finally, we show that if a function a : JV d —>■ M 
is non-negative homogeneous and every sublevel set {/ G : o(f ) < 7 } is convex, then 
a satisfies the triangular inequality. 

Convexity of the level sets First we show that for any two functions fi, f 2 G M d p q<1 and 
0 < a < 1, the function g = a fi + (1 — a)f 2 is in the hypothesis class J\f d p q<1 . We prove 
this by constructing weights W that reali z es g. Let U and V be the weights of two neural 
networks such that 7 p , q (U) = lp. q {f\) < 7 and 7 P ,g(^) = lp, q (h) < 7 - For every layer 
i — 1 ,..., d let 

= tfFJFw v, = {fFJy)Vii\\n\ m - 

and set W\ = 

W d = [aU d (1 — a)V d ] for the output layer. 

Then for the defined W, we have fw = «/i + (1 — a)f 2 for rectified linear and any 
other non-negative homogeneous activation function. Moreover, for any i < d, the norm 
of each layer is 

IIWIU = (tmW 5 +7 m (Z)5 ) 5 < (17) 

and in layer d we have: 

||W d || p = (a p 7p , q (U)% + (1 - a) p 7p , q (V)^y < 2 1 ^- 1 7 1/d (18) 

, d- 1 ,1 

Combining inequalities (17) and (18), we get 7 p q (fw) 7 2 « *>7 < 7 , where the last 

inequality holds because we assume that < ^-(l — Thus for every 7 > 0, J\f d p q<J 
is a convex set. 

Non-negative homogeneity For any function / G A f d and any q > 0, let U be the 
weights realizing/with 7 p(J (/) = 7pJi (U). Then (fall realizes af establishing 7 d ri (o /') < 
= a lp , q {U) = ccf d q (U) = Oi 7 d q {f). This establishes the non-negative homo¬ 
geneity of 7 p g . 


for the first layer, II 7 = 


^ 0 
0 Vi 


for the intermediate layers and 
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Convex sublevel sets and homogeneity imply triangular inequality Let a(f) be non¬ 
negative homogeneous and assume that every sublevel set {/ G Af d : «(/) < 7 } is 
convex. Then for f\ , ,/ 2 G A +7 defining 71 = a(7i), 72 = a(/ 2 ), fi = (71 + 72)/i/7i» and 
f'2 = (71 + 72)/2/72, we have 

«(/i + / 2 ) = « f—^—7i + - 7 - 7 —A) < 7i + 72 = «(7i) + «(72)- 

V 71 + 72 7i+72 / 

Here the inequality is due to the convexity of the level set and the fact that a(J \) = 

71 + 72 , because of the homogeneity. Therefore a satisfies the triangular inequality 
it is a seminorm. 


a(/ 2 ) = 

and thus 


Appendix C. Path Regularization 

C.l. Theorem 5 

Lemma 19 For any function f G Aa^< 7 f/rere is a layered network with weights w such 
that 7 P , 00 ( 111 ) = 7p,’oo (7) and for any internal unitv, J2 ( u ^v)&e \ w ( u v )\ p — 1- 

Proof Let w be the weights of a network such that y p ,oo(w) = 7+^(7)- We now con¬ 
struct a network with weights w such that 7 PiOC (u;) = 7 p +( 7 ) and for any internal unit v, 
Yh{u^tv)&E \w(u —> v)\ p = 1. We do this by an incremental algorithm. Let tc 0 = ml At 
each step i, we do the following. 

Consider the first layer, Set 14 to be the set of neurons in the layer k. Let x be the 
maximum of i p norms of input weights to each neuron in set V\ and let U x C I7 be the set 
of neurons whose t p norms of their input weight is exactly x. Now let y be the maximum 
of £ p norms of input weights to each neuron in the set I 7 \ U x and let U y be the set of 
the neurons such that the t p norms of their input weights is exactly y. Clearly y < x. 
We now scale down the input weights of neurons in set U x by y/x and scale up all the 
outgoing edges of vertices in U x by x/y ( y cannot be zero for internal neurons based on 
the definition). It is straightforward that the new network realizes the same function and 
the £ p oo norm of the first layer has changed by a factor y/x. Now for every neuron v G V 2 , 
let r(v) be the £ p norm of the new incoming weights divided by t p norm of the original 
incoming weights. We know that r{y) < x/y. We again scaly down the input weights of 
everyy G V 2 by 1 /r(v) and scale up all the outgoing edges of v by r(v). Continuing this 
operation to on each layer, each time we propagate the ratio to the next layer while the 
network always realizes the same function and for each layer k, we know that for every 
v G 14, r(v) < x/y. After this operation, in the network, the £ p]OC norm of the first layer is 
scaled down by y/x while the f p oo norm of the last layer is scaled up by at most x/y and 
the £ p oo norm of the rest of the layers has remained the same. Therefore, if Wi is the new 
weight setting, we have 7 Pl 00 (wi) < 7 p ,oo(m 7 -i)- 
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After continuing the above step at most \V\ \ — 1 times, the £ p norm of input weights is 
the same for all neurons in V±. We can then run the same algorithm on other layers and at 
the end we have a network with weight setting w such that the for each k < d,£ p norm of in¬ 
put weight to each of the neurons in layer k is equal to each other and 7 P , 00 ('5i) < 'y p , 00 {'w). 
This is in fact an equality because weight setting w' realizes function / and we know that 
7 p-oc (t/.') = 7 A simple scaling of weights in layers gives completes the proof. ■ 


Theorem 5 Forp > 1, any d and (finite or infinite) H, for any f e J\[ d,H : <$> p H {f) = 7 P ’<2- 


Proof By the Lemma 19, there is a layered network with weights w such that 7 p ,oo(w) = 
Ip, 00 (f) an( l f° r an y internal unit v, Y1 ( u ^v)ge \^( u —* v)\ p = 1. Let W be the weights of 
the layered network that corresponds to the function w. Then we have: 


v p [w) = 


l k V 

V^inW^l^-A^out l_1 / 

H HD d—1 

E - ■ E Ei l ^[^-i]r 

V*d—1=1 *1=1 *0=1 k= 1 

H HD 

7d-i=i *1=1 *0=1 

H H 

, *d—1 = 1 *1 = 1 

H H 

l*d-l=l *2 = 1 

1 

H \ p 

\W d [i d -i]\ p = £ P (W d ) = 7p ,oo (W) 

i*d-1 = 1 / 


(19) 

( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 

(25) 


where inequalities 20 to 24 are due to the fact that the i v norm of input weights to each in¬ 
ternal neuron is exactly 1 and the last equality is again because l P)00 of all layers is exactly 
1 except the layer d. ■ 
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C.2. Proof of Theorem 6 

In this section, without loss of generality, we assume that all the internal nodes in a DAG 
have incoming edges and outgoing edges because otherwise we can just discard them. Let 
d out (v) be the longest directed path from vertex v to v out and d in (v) be the longest directed 
path from any input vertex v- m [i\ to v. We say graph G is a sublayered graph if G is a 
subgraph of a layered graph. 

We first show the necessary and sufficient conditions under which a DAG is a sublay¬ 
ered graph. 

Lemma 20 The graph G(E, V ) is a sublayered graph if and only if any path from input 
nodes to the output nodes has length d where d is the length of the longest path in G 

Proof Since the internal nodes have incoming edges and outgoing edges; hence if G is 
a sublayered graph it is straightforward by induction on the layers that for every vertex v 
in layer i, there is a vertex u in layer i + 1 such that (v -A u) G E and this proves the 
necessary condition for being sublayered graph. 

To show the sufficient condition, for any internal node u, u has d in (v) distance from 
the input node in every path that includes u (otherwise we can build a path that is longer 
than d). Therefore, for each vertex v G V, we can place vertex v in layer <7 in (i;) and all the 
outgoing edges from v will be to layer d in (v) + 1. ■ 


Lemma 21 If the graph G(E. V ) is not a sublayered graph then there exists a directed 
edge [u —* v) such that d in {u ) + d out (v) < d — 1 where d the length of the longest path in 
G. 

Proof We prove the lemma by an inductive argument. If G is not sublayered, by lemma 20, 
we know that there exists a path v 0 —>■... ty • • ■ —> v# where v 0 is an input node (ri in (t'o) = 
0), Vd' = ^out (d oxA (vd' = 0) and d! < d. Now consider the vertex v\. We need to have 
dout(vi) — d — 1 otherwise if d Q ut (vi) < d — 1 we get d m {u) + d out (v) < d — 1 and if 
ciout(vi) > d — 1 there will be path in G that is longer than d. Also, since d 0 ut(vi) = d — 1 
and the longest path in G has length d, we have d in (vi) = 1. 

By applying the same inductive argument on each vertex v t in the path we get d m (v,) = i 
and d out (vi) = d — i. Note that if the condition d- m {u) + d out (v) < d — 1 is not satisfied 
in one of the steps of the inductive argument, the lemma is proved. Otherwise, we have 
din(v d '-i) = d! - 1 and d ou t(v d '-i) = d - dl + 1 and therefore d m {v d '- i) + d 0 ut(v ou t) = 
d! — 1 < d — 1 that proves the lemma. ■ 

Theorem 6 For any p > 1 and any d: = min d/f ( f). 

' 'P,oow G S DAG{d) P 
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Proof Consider any f GjW € ff DAG(d > where the graph G(E, V ) is not sublayered. Let p be 
the total number of paths from input nodes to the output nodes. Let T be sum over paths of 
the length of the path. We indicate an algorithm to change G into a sublayered graph G of 
depth d with weights w such that f G>w — f G ~ u and <p{w) = 4>(w). Let G 0 = G and w 0 = w. 

At each step i, we consider the graph G^_i. If G',;_ i is sublayered, we are done otherwise 
by lemma 21, there exists an edge (u —>■ v) such that d in (u) + d out (v) < d — 1. Now we 
add a new vertex u* to graph G' ?; _|, remove the edge (u —* v), add two edges (u —> vf) 
and (v t —> v ) and return the graph as G t and since we had d in (u) + d out (v) < d — 1 in 
G*j_i, the longest path in Gj still has length d. We also set w(u —> Vi ) = y/\w(u —)■ u)| 
and w{vi —> v ) = sign(u;(-u —>• v))^\w(u —> u)|. Since we are using rectified linear units 
activations, for any x > 0 , we have [./;] + = x and therefore: 


w(Vi —* V ) [w(u —> Vi)o(u)] + 


sign(tu(M —> v 

sign(w(-u —> v 
w{u —> v)o{u ) 


))y/H« ^ v)| \/\w(u — >■ u)|o( , u) 
))y/|w(M -A u) I —> v)\o(u) 


So we conclude that fc l: w t = fci- Clearly, since we didn’t change the length of any 
path from input vertices to the output vertex, we have 0(w) = 4>(w)- Let T t be sum over 
paths of the length of the path in G,. It is clear that T t _] < Ti because we add a new edge 
into a path at each step. We also know by lemma 20 that if T t = pd, then G, is a sublayered 
graph. Therefore, after at most pd — T 0 steps, we return a sublayered graph G and weights 
w such that f G>w = f (J l7r We can easily turn the sublayered graph G a layered graph by 
adding edges with zero weights and this together with Theorem 5 completes the proof. ■ 


Appendix D. Hardness of Learning Neural Networks 

Daniely et al. (2014) show in Theorem 5.4 and in Section 7.2 that subject to the strong 
random CSP assumption, for any k = oj(1) the hypothesis class of intersection of homoge¬ 
neous halfspaces over {±1}” with normals in {±1} is not efficiently PAC learnable (even 
improperly) 2 . Furthermore, for any e > 0, Klivans and Sherstov (2006) prove this hardness 
result subject to intractability of Q(77 l " 5 )-uniquc shortest vector problem for k = D' . 

If it is not possible to efficiently PAC learn intersection of halfspaces (even improperly), 
we can conclude it is also not possible to efficiently PAC leam any hypothesis class which 
can represent such intersection. In Theorem 22 we show that intersection of homogeneous 
half spaces can be realized with unit margin by neural networks with bound norm. 

Theorem 22 For any k > 0, the intersection of k homogeneous half spaces is realizable 
with unit margin by J\f^ pq<1 where 7 = AD^k 2 . 

2. Their Theorem 5.4 talks about unrestricted halfspaces, but the construction in Section 7.2 uses only data 

in {il} 13 and halfspaces specified by ( w , x) > 0 with w £ {±1} D 
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Proof The proof is by a construction that is similar to the one in Livni et al. (2014). For 
each hyperplane (wi,x) > 0 , where W{ E {± 1 } D , we include two units in the first layer: 
gf{x) = [(wi, x)]_|_ and g~(x) = [( Wi , x) — 1]+. We set all incoming weights of the output 
node to be 1. Therefore, this network is realizing the following function: 

k 

f ( x ) = ^2 ([(wi, ®)]+ - [( w u x ) ~ !]+) 

i= 1 

Since all inputs and all weights are integer, the outputs of the first layer will be integer, 
([(iUj,a:)] + — x) — 1 ] + ) will be zero or one, and / realizes the intersection of the k 
halfspaces with unit margin. Now, we just need to make sure that 7 p q (f) is bounded by 

7 = 4D~pk 2 : 




2 (/) = Dv(2k)«(2k)v 
< Dp(2k ) 2 = 7. 
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